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Abstract 

Important information concerning a multivariate data set, such as clusters and modal 
regions, is contained in the derivatives of the probability density function. Despite 
this importance, nonparametric estimation of higher order derivatives of the density 
functions have received only relatively scant attention. Kernel estimators of density 
functions are widely used as they exhibit excellent theoretical and practical properties, 
though their generalization to density derivatives has progressed more slowly due to 
the mathematical intractabilities encountered in the crucial problem of bandwidth (or 
smoothing parameter) selection. This paper presents the first fully automatic, data- 
based bandwidth selectors for multivariate kernel density derivative estimators. This is 
achieved by synthesizing recent advances in matrix analytic theory which allow math- 
ematically and computationally tractable representations of higher order derivatives 
of multivariate vector valued functions. The theoretical asymptotic properties as well 
as the finite sample behaviour of the proposed selectors are studied. In addition, we 
explore in detail the applications of the new data-driven methods for two other statis- 
tical problems: clustering and bump hunting. The introduced techniques are combined 
with the mean shift algorithm to develop novel automatic, nonparametric clustering 
procedures which are shown to outperform mixture-model cluster analysis and other 
recent nonparametric approaches in practice. Furthermore, the advantage of the use of 
smoothing parameters designed for density derivative estimation for feature significance 
analysis for bump hunting is illustrated with a real data example. 

Keywords: adjusted Rand index, cross validation, feature significance, nonparametric kernel 
method, mean integrated squared error, mean shift algorithm, plug-in choice 
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1 Introduction 



The estimation of density derivatives has full potential for applications. This has been noted 



even in the first seminal papers on density estimation, as Parzen (1962), which was also 



concerned with the estimation of the mode of a unimodal distribution, the value that makes 



zero the first density derivative. In the multivariate case, the pioneering work of Fukunaga 



and Hostetler ( 1975 ) showed how the estimation of the gradient vector can also be used for 



clustering and data filtering, leading to a substantial amount of literature on the subject, 



under the name of the mean shift algorithm. Looking further afield, Cheng (1995) made 



use of the mean shift idea for image analysis, and the highly-cited paper by Comaniciu 



and Meer (2002) showed how these techniques can be useful for low- level vision problems, 



discontinuity preserving smoothing and image segmentation. A further very popular use 
of the mean shift algorithm is for real-time object tracking, as described in | Comaniciu, 



Ramesh and Meer (2003). 



From the perspective of statistical data analysis, in the multidimensional context the 
estimation of the first and second derivatives of the density is crucial to identify significant 
features of the distribution, such as local extrema, valleys, ridges or saddle points. In this 
sense, Godtliebsen, Marron and Chaudhuri (2002) developed methods for determining and 



visualizing such features in dimension two, extending previous work on scale space ideas 



introduced in Chaudhuri and Marron (1999) for the univariate case (the SiZer approach), 



and the same authors also explored the application of this methodology to digital image 



analysis in Godtliebsen, Marron and Chaudhuri (2004). Duong et al. (2008) generalized 



these results for multivariate data in arbitrary dimensions and provided a novel visualization 
for three-dimensional data. These techniques have been widely applied recently in the field 



of flow cytometry; see Zeng et al. (2007), Naumann and Wand (2009) or Pratt et al. (2009) 



Another relatively new problem that is closely related to gradient estimation is that of 
finding filaments in point clouds, which has applications in medical imaging, remote sensing, 



seismology and cosmology. This problem is rigorously stated and analyzed in Genovese et 



al. (2009). Filaments are one-dimensional curves embedded in a point process, and it can 



be shown that steepest ascent paths (i.e., the paths from each point following the gradient 
direction) concentrate around them, so gradient estimation appears as a useful tool for 
filament detection. 

In this paper we focus on kernel estimators of multivariate density derivatives of arbitrary 
order, formally defined in Section [2] below. As for any kernel estimator, it is well known that 
the crucial factor that determines the performance of the estimator in practice is the choice 
of the bandwidth matrix. In the multivariate setting there are several levels of sophistication 



at the time of specifying the bandwidth matrix to be used in the kernel estimator (see Wand 



and Jones, 1995, Chapter 4). The most general bandwidth type consists of a symmetric 
positive definite matrix; it allows the kernel estimator to smooth in any direction whether 
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coordinate or not. This general class of bandwidth matrices can be constrained to consider 
positive definite diagonal matrices, allowing for different degrees of smoothing along each of 
the coordinate axis, or even further to consider a bandwidth matrix involving only a positive 
scalar multiple of the identity matrix, meaning that the same smoothing is applied to every 
coordinate direction. As noted in Wand and Jones (1993) in the density estimation context, 



the single-parameter class should not be used for unsealed data or, as stated by Comaniciu 



and Meer (2002) in terms of feature space analysis, to use this bandwidth class at least the 



validity of an Euclidean metric for the feature space should be previously checked. 

The simpler parameterizations are in general more widely used than the unconstrained 
counterpart for two reasons: first, in practice they need less smoothing parameters to be 
tuned, and second, due to the difficulties encountered in the mathematical analysis of es- 
timators with unconstrained bandwidths. However, Chacon, Duong and Wand (2011) pro- 
vided a detailed error analysis of kernel density derivative estimators with unconstrained 
bandwidths and showed that the use of the simpler parameterizations can lead to a sub- 
stantial loss in terms of efficiency, and that this problem becomes more and more important 
as the order of the derivative to be estimated increases. 



Chacon, Duong and Wand (2011) also proposed an optimal bandwidth selector for 



the normal case, but they did not develop more sophisticated data-driven choices of the 
bandwidth matrix with applicability to more general densities, which is crucial to make 



density derivative estimation useful in practice. Along the same lines, Comaniciu and Meer 



( 2002 ) argue that most existing bandwidth selection methods for the mean shift algorithm, 
all of them for the single-parameter class of bandwidths, are based on empirical arguments. 
In the univariate case there exist some approaches to bandwidth selection for den- 



sity derivative estimation: Hardle, Marron and Wand ( 1990 ) introduced a cross validation 



method and showed its optimality; Jones (1992) derived the relative rate of convergence 
of this method and also for a plug-in proposal; Wu (1997) studied two root n selectors 
in the Fourier domain, and more recently Dobrovidov and Rud'ko (2010) focused on the 
smoothed cross validation bandwidth selector for the density derivative. In the multivariate 
case, however, the issue of automatic bandwidth selection for density derivative estimation 
has remained largely unexplored. Given the smaller body of multivariate density estimation 
research as compared to their univariate cousins, it is not surprising that multivariate den- 
sity derivative estimation suffers equally (if not more so) from a lack of solid results. To the 
best of our knowledge, in the literature the only published approaches to bandwidth selec- 



tion for multivariate kernel estimation of density derivatives are the recent papers Horova 



and Vopatova (2011) and Horova, Kolacek and Vopatova (2013), but both focus exclusively 



on the first derivative. 

This paper proposes three new methods for unconstrained bandwidth matrix selection 
for the multivariate kernel density derivative estimator, and explores their applicability to 
other related statistical problems. In Section [2l we introduce the mathematical framework 
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for the analysis of multivariate derivatives. In Section [3] we show that the relative rate 
of convergence of these unconstrained selectors is the same as for the classes of simpler 
bandwidth matrices, so that from an asymptotic point of view our methods can be as 
successful as (and more flexible than) those needing less smoothing parameters. The finite- 
sample behaviour of the new bandwidths is investigated in Section [4j and their application 
to develop new data-driven nonparametric clustering methods via the mean shift algorithm 
is explored in Section [5j and for feature significance in Section [6j Finally, the proofs of the 
results are given in an appendix. 



2 Kernel density derivative estimation 

The problem of estimating the r-th derivative of a multivariate density is considered in this 
section. From a multivariate point of view, the r-th derivative of a function is understood 
as the set of all its partial derivatives of order r, rather than just one of them. Notice that, 
for instance, in a multivariate Taylor expansion of order r all of the partial derivatives of 
order r are needed to compute the r-th order term. Or, in another related example, all the 
second order partial derivatives are involved in the computation of the Hessian matrix. 

All the r-th partial derivatives can be neatly organized into a single vector as fol- 
lows: if / is a real d-variate density function and x = (x\, . . . , Xd), denote by D = 
d/dx = (d/dxi, . . . , d/dxd) the first derivative (gradient) operator. All the second or- 
der partial derivatives can be organized into the Hessian matrix H/ = (d 2 f /dxidxj)f - =1 , 
and the Hessian operator can be formally written as H = DD T if the usual convention 
{d / dxi){d / dxj) = d 2 / (dxidxj) is taken into account. For r > 3, however, it is not that clear 
how to organize the set containing all the d r partial derivatives of order r. Here we adopt 



the unified approach used in Holmquist (1996a) or Kollo and von Rosen (2005, Section 1.4), 
where the r-th derivative of / is defined to be the vector D® r f = (D/) 0r = d r f /dx® T G R dr . 
In the previous equation D® r denotes the r-th Kronecker power of the operator D; see, 



e.g., Magnus and Neudecker (1999) for the definition of the Kronecker product. Naturally, 
D® / = /, D® 1 / = D/ and, for example, D® 2 = vecH, where vec denotes the operator 
which concatenates the columns of a matrix into a single vector. 

Here we study the problem of estimating the r-th derivative D® r f from a sample 
Xi, . . . , X n of independent and identically distributed random variables with common den- 
sity /. The usual kernel estimator of / is defined as fn( x ) = n ~ l Y2i=i ^n( x — Xj), 
where the kernel K is a spherically symmetric density function, the bandwidth H is a 
symmetric positive definite matrix and K-n(x) = |H| _1//2 i^(H _1 / 2 a;). Thus, the most 
straightforward estimator of D® r f is just the r-th derivative of /h 5 given by D® r fn(x) = 
n l ^27=1 ^® r Kn(x — Xj), where the roles of K and H can be separated for implementa- 
tion purposes by noting that D® r K u (x) = |H|- 1 /2( H - 1 /2)®r D cg>r K ( H -i/2 a ,) ) ag ghown in 



Chacon, Duong and Wand (2011 ), where for any matrix A it is understood that A® = 1 G 
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R and A® 1 = A. See, however, Jones (1994) for other possible estimators in the univariate 
context. 



For the density estimation case (r = 0), Wand and Jones (1993) showed that the use of 



bandwidths belonging to the class X = {h 2 Id : h G M}, with Id the d x d identity matrix, 
or the class V = {diag(/tf , h%, ■ ■ ■ , h^) : h%, fi2, ■ . . , hd £ M}, may lead to dramatically less 
efficient estimators than those based on bandwidth matrices drawn from J 7 , the space of all 



positive definite symmetric matrices. Moreover Chacon, Duong and Wand (2011) showed 



that the issue of efficiency loss is even more severe for r > 1. So the development of 
unconstrained bandwidth selectors for density derivative estimation, which is achieved in 
this paper, may also represent an important improvement in practice. 

To measure the error committed by the kernel estimator for the sample at hand it is 
natural to consider the integrated squared error (ISE), defined as 



ISE r (H) 



\D® r f u (x)-D® r f(x)fdx, 



where || • || denotes the usual Euclidean norm. This quantity depends on the data, so it 
is common to consider the mean integrated squared error MISE r (H) = E[ISE r (H)] as a 
non-stochastic measure of error, and its minimizer HfviiSE.r = argmin Hg jrMISE r (H) as the 
non-stochastic optimal bandwidth choice. A more detailed discussion of the advantages and 



disadvantages of the ISE and MISE viewpoints can be found in Jones (1991). 



Standard calculations lead to the integrated variance plus integrated squared bias de- 
composition MISE r (H) = lV r (H) + ISB r (H), where IV r (H) = / Rd tr Var{D® r f n (x)]dx 
and ISB r (H) = f Rd \\E[D® r f H (x)] - D® r /(^)l| 2 ^- By expanding each of these two terms, 
Chacon, Duong and Wand ( |2011 ) showed that a more explicit form of the MISE is given by 



MISE r (H) = {n- 1 |Hr 1 / 2 tr((H- 1 )® r R(D® r " J R:)) - n' 1 tr R*(K H * K u , D® r /)} 
+ {tr R*(K H * K U , D® r /) " 2trR*(^ H , D® r /) + trR(D®7)} 



(1) 



where R(g) = f Rd g(x)g(x) T dx, and R*(a, g) = J M d(« * s)( x )s{ x ) T dx for a vector valued 
function g and a real valued function a, with a*g standing for a component- wise application 
of the convolution operator. 

Moreover, writing m2{K)l^ = f Rd xx T K(x)dx, under some smoothness assumptions 



Chacon, Duong and Wand (2011) also showed that an asymptotic approximation of the 



MISE is given by 

AMISE r (H) 



n _1 |H| 



-1/2 



+ 



m 2 (K) 2 



tr ((H- x )® r R(D® r i<:)) 
tr ((I d r (g) vec T H)R(D®( r+2 )/)(Id- ® vecH)) 



(2) 



and that the minimizer of this AMISE function, Hamise^ = argmin Hg jrAMISE r (H), 
has entries of order 0(n -2 /( (i+2r+4 )), leading to a minimum achievable AMISE of order 
0(n- 4 /(<*+ 2 r+4)). 
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Although these expressions provide an insightful error analysis of multivariate kernel 
density derivative estimators, they are not directly implementable as software since they 
all involve the unknown density /. In the next section, we examine strategies to estimate 
these unknown quantities which lead to optimal data-based selectors for density derivative 
estimation. 



3 Bandwidth selection methods 

In this section we propose three new methods to select the bandwidth matrix for kernel 
density derivative estimation from the data and study their asymptotic properties. These 
methods are inspired by the cross validation, plug-in and smoothed cross validation method- 
ologies for the estimation of the density in the univariate case, hence their names. 



3.1 Cross validation method 

Cross validation (CV) techniques for bandwidth selection for univariate density estimation 
were introduced in Rudemo (1982) and Bowman (1984), and studied in detail in seminal 
papers like |Hall| ( fl983| ) , |Stone| ( |1984[ ) and |Hall and Marron| jl987[ ). They can be motivated 
in terms of either ISE-oriented or MISE-oriented considerations. 

In the case of multivariate density derivative estimation notice that, for a random vari- 
able X having density / and independent of Xi , . . . , X n , using integration by parts it is 
possible to write 



ISE r (H) 



-l) r vec T I d r 



+ trR(D®7) 



in- 2 D® 2r K H * Kn(Xi - X,) - 2E[D® 2 7h(X)] } 

^ i,i=l J 



provided that the kernel K is sufficiently smooth. The last term is irrelevant for minimizing 
concerns, and the two first terms can be unbiasedly estimated by 



CV r (H 



-l) r vec T I, 



s n n n. 

| n- 2 D 02 ^H * # H (X; - X,) - 2n- 1 £ D^/h^X;) I 

^ i,j=l i=l ' 

(-iyvec T I d ,{n- 2 D^ J ftr H *^H(X i -X J )-2[n(n-l)]- 1 ^D^K H (X l -X J )| 



where D® 2r /H,i denotes the kernel estimator based on the sample with the i-th observation 
deleted. 
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From the MISE point of view notice that, for any smooth enough function L, 
trR*(L H , D®7) = / (Lh* D® r f)(x) T D® r f(x)dx 

= (-If vec T I d r [ D® 2r L n * f(x)f(x)dx 
= (-irvec T VE[D® 2r L H (X 1 -X 2 )], 
so that trR*(Ln, D® r f) can be unbiasedly estimated by 

(-iy[n(„ - l)]" 1 vec T Irfr £ D^Lh(X, - X,). 



Therefore, in view of ([I]), MISE r (H) — trR(D® r '/) can be unbiasedly estimated by 
CV r (H) = n-^Hl-^tr ((H~ 1 )® r R(D l8 "'if)) 

n 

+ (_!)»■[„(„ _ V ec T I dr £ {(1 - n-^D^H - 2D^Kh}(X 4 _ X 



J/! 



where K = K * K. To check that these two CV expressions coincide, take into account 
that D® 2r Kn * K n = D® 2r (Kn * K n ) = D® 2r (K * K) n = D® 2r K n , so that using some 



properties of the Kronecker product and the vec operator (see, e.g., Magnus and Neudecker 



1999), the sum of the diagonal terms in the first CV r (H) expression equals 

{-Ifn- 1 vec T I f D 82r K H (0) = (-lfrT^Hr 1 / 2 vec T l d r{U- l l 2 )® 2r D mr K{0) 

= (-lJV^HI" 1 / 2 vec T (H- 1 )® r D® 2r K(0) 
= n- l \H\- 1 ' 2 vec T (Er 1 )® r vec R(D® r K) 
= rr l \H\- 1 ' 2 tr ((H- 1 )® r R(D®'"ir)) 

where the third line follows by noting that vecR(D (g,r K) = (— l) r L d D® 2r ' K(x)K{x)dx = 
(-iyD® 2r K(0). 

Surely the simplest formulation for CV (useful for implementation purposes) is 

CV r (H) = (-l) r |H|- 1 / 2 vec T (H- 1 )® r ^ n' 2 D® 2r if (HP^X; - Xj)) 

^ i,j=i 

- 2[n(n - l)]" 1 ^ D® 2r K(H" 1 / 2 (X i - Xj)) }. 

We denote by Hcv,r the bandwidth matrix in T which minimizes CV r (H). 
3.2 Plug-in method 

Plug-in (PI) bandwidth selection techniques are based on estimating the unknown quantities 
that appear in an asymptotic error formula and minimizing the resulting empirical criterion. 
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Basic plug-in selectors for univariate density estimation are described in Park and Marron 



(1990) and Sheather and Jones (1991). In the multivariate case, introducing the vector 



integrated density derivative functional, defined as 

</> 2r = / D®*f(x)f(x)dx = (-l) r vecll(p 9r f), 
allows us to rewrite the AMISE formula ([2]) for the r-th derivative as 

AMISE r (H) = n-^Hl -1 / 2 tr ((Er 1 )® r R(D® r iQ) 

(vecH)® 2 ). 



+ ( _ ir n^ Lv ,T +4(vecIdr 



Thus, the plug-in bandwidth selector Hpi jr is defined to be the bandwidth in T minimizing 
the criterion 



Plr(H) = n-^Hr^tr ((H _1 )® r R(D® r iir)) 



+ (-1) 



r m 2 (Kf 



</>2~r+4 ( y ec Irfr ® (vec H)® 2 ) , 



where ip 2 r+A ls a suitable estimator of ^2r+4- 



Chacon and Duong (2010) analyzed the problem of estimating tp 2r f° r an arbitrary r. 



Noting that ip 2r = E[D® 2r /(X)], they proposed the kernel estimator 



^ 2r (G) = n- 1 ^D® 2 7nG(X J 



n 



i=l 



Ed 



® 2r WX I 



Xi), 



using a kernel L with pilot bandwidth G, possibly different from K and H. For the selection 
of the pilot bandwidth matrix G, the same authors showed that the leading term of the 
mean squared error E [||$ 2 r(G) — "02rll 2 ] i s given by the squared norm of the asymptotic 
bias vector 



Wpi2r(G) 



n 



| G |-l/2 (G -l/ 2) ®2r D ®2r L(()) + m^L) q 



2r+2> 



(3) 



so that the asymptotically optimal choice of the pilot bandwidth for the estimation of tp 2r 
is Gpi^r = argminQgjrllujpi^rlG)!! 2 , which depends on ?/> 2r+2 . 

Hence, to select the pilot bandwidth G from the data we could substitute ip 2r+2 by 
another kernel estimator in ^ and minimize the squared norm of the resulting vector, but 
of course then the optimal bandwidth for the kernel estimator of ip2r+2 depends on "0 2r+ 4, 
and so on. The usual strategy to overcome this problem is to use an m-stage algorithm as 



described in Chacon and Duong (2010), involving m successive kernel functional estimations 



with the initial bandwidth matrix chosen on the basis of a normal scale approximation. The 
resulting bandwidth obtained by minimizing PI r (H) when V 2 r+4 i s estimated using an m- 
stage algorithm is called an m-stage plug-in bandwidth selector for the r-th derivative. 
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3.3 Smoothed cross validation method 



The smoothed cross validation (SCV) methodology for univariate density estimation was 



introduced in Hall, Marron and Park: (1992), and a thorough study of this technique was 



made in Jones, Marron and Park (1991). However, it has not been until recently that its 



multivariate counterpart has been developed, in Duong and Hazelton (2005b) and Chacon 



and Duong (2011), and its use for univariate density derivative estimation has been explored 



(see Dobrovidov and Rud'ko, 2010). 

A possible derivation of the SCV criterion for the problem of multivariate density deriva- 
tive estimation is based on the approximation of the MISE obtained by replacing the exact 
integrated variance in equation ([I]) by its asymptotic approximation (the first term), while 
keeping the exact form for the integrated squared bias, so that MISE r (H) ~ MISE2 r (H) 
with 

MISE2 r (H) =n- 1 |H|- 1 / 2 tr((H- 1 ) 0r R(D 0r i^)) + tr R*(A H , D 0r /), 

where Ah = Kh — Kq (here K$ denotes the Dirac delta function) and Ah = Ah * Ah = 
Kh — 2Kn + Kq. The SCV criterion is obtained by replacing the unknown target D® r / in 



the MISE2 formula with a pilot estimator D® r /g(x) 
to 



n 



-i 



Er=i D® r L G (x-Xi), leading 



SCV r (H) 



n 



^Hl" 1 / 2 tr ((H-^Rp^if )) 

n 

+ (-l)V 2 (vec T I rfr ) ^ A H *D« 



'LrJXi-X 



3 )i 



where L = L * L. When all the Xj are distinct and the diagonal terms (i = j) are omitted 
in the previous sum it can be shown, using the properties of the Dirac delta function (see, 



e.g., Gel'fand and Shilov 1964, Chapter 1.2), that the SCV criterion coincides with the CV 



criterion for G = 0. 

The minimizer of SCV r (H) is defined to be Hscv,r- Its value depends on the pilot selec- 
tor G. Chacon and Duong ( 2011 ) showed that in the case r = the leading term of the mean 
squared error E || vec(Hscv> — HMisE,r)|| 2 is given by the squared norm ||^scv,2r+4(G)|| 2 
where <^scv,2r+4(G) is the same as the aforementioned o;pi j 2r+4(G) except that L is re- 
placed by L. Thus it is straightforward to define, analogously to the plug-in algorithm, the 
required optimal k-th stage pilot bandwidth of an m-stage SCV selector. 



3.4 Convergence results 

Let H r = argmin Hg jrMISE r (H) be an arbitrary data-based bandwidth selector, built up 
on the basis of an estimated criterion MISE r (H). Following Duong and Hazelton (2005a), 
H r is said to converge to Hmise r at relative rate n~ a if 



vec(H - Hmise, r) = P (n a J d2 ) vecH M isE/, 
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where Op denotes element- wise order in probability and 3^2 is the d 2 x d 2 matrix of ones. 
This order in probability statement can be difficult to derive directly. The next lemma 
provides a more tractable indirect method of calculating convergence rates. 

Lemma 1. Suppose that assumptions (Al)-(A3) given in the appendix hold. The discrep- 
ancy vec(H r — Hmise.t-) is asymptotically equivalent to DEMISE,.— MISE r ](HMisE,r)> where 
Dh is shorthand for d/dvecH. Furthermore, the relative rate of convergence o/H r is n~ a 
*f 

E{D H [MiSE,. - MISE r ](H M isE,r)}IE{DH[MISEr - MISE r ](H M i S E,r)} T 

+ Var{D H [MISE r - MISE r ](H M i S E,r-)} = 0(n~ 2a J d2 ) vec H M i S E,r vec T H M i S E,r. 

The convergence rates of the three bandwidth selectors considered here are given in the 
following theorem, whose proof is deferred to the appendix. 

Theorem 1. Suppose that assumptions (Al)-(A5) given in the appendix hold. The relative 
rate of convergence to HMisE.r is n - rf /( M + 4r + 8 ) f or the cross validation selector Hcv, r , and 
n -2/(d+2r+6) j Qr ^ e pi U g_i n selector Hpi,r and the smoothed cross validation selector Hscv> 
when d > 2. 



Jones ( 1992 ) computed the relative rate of convergence for the CV and PI selectors for 
the estimation of a single partial derivative, using a single-parameter bandwidth matrix (i.e., 
Hel). The previous theorem shows that the unconstrained CV bandwidth attains the 
same rate as its constrained counterpart, yet with added flexibility that should be captured 
in the constant coefficient of the asymptotic expression, although the computation of an 
explicit form for this coefficient does not seem possible in general. 

The convergence rate of the PI selector is n -C 2 + rnm { 2 > d / 2 })/( d +' 2r +6) within the single- 
parameter bandwidth class I, yielding a slightly faster convergence to the optimal con- 



strained bandwidth. As explained in Chacon and Duong (2010, 2011) for the density case, 



this is due to the fact that the very special cancellation in the bias term which is achievable 
when using a single-parameter bandwidth is not possible in general for the unconstrained 
estimator. Nevertheless, the aforementioned papers showed that this slight loss in con- 
vergence rate terms is negligible in practice as compared with the fact that the targeted 
constrained optimal bandwidth is usually much less efficient than the unconstrained one 
(see also Section [4] below) . 

Theorem [l] also shows that the similarities noted in Chacon and Duong (2011) about 



the asymptotic properties of the PI and SCV methods for the density estimation problem 
persist for r > 0, since both selectors exhibit the same relative rate of convergence. 



Jones, Marron and Sheather ( 1996 , p. 406) exemplified how slow is the rate n~ 1 ' 10 of the 
CV selector for d = 1, r = by noting that n has to be as large as 10 10 = 10, 000, 000, 000 
so that n -1 / 10 = 0.1. In the same spirit, to compare the rates obtained in Theorem [TJ 
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d = 2 d = 3 d = 4 d = 5 

CV PI/SCV CV PI/SCV CV PI/SCV CV PI/SCV 



r 


= 


n = 


10 3 


1.000 


0.562 


0.720 


0.681 


0.562 


0.794 


0.464 


0.901 






n = 


10 4 


0.681 


0.316 


0.439 


0.408 


0.316 


0.501 


0.245 


0.593 






n = 


10 5 


0.464 


0.178 


0.268 


0.245 


0.178 


0.316 


0.129 


0.390 


r 


= 1 


n = 


10 3 


1.334 


0.794 


1.000 


0.901 


0.794 


1.000 


0.658 


1.093 






n = 


10 4 


1.000 


0.501 


0.681 


0.593 


0.501 


0.681 


0.390 


0.767 






n = 


10 5 


0.750 


0.316 


0.464 


0.390 


0.316 


0.464 


0.231 


0.538 


r 


= 2 


n = 


10 3 


1.585 


1.000 


1.233 


1.093 


1.000 


1.179 


0.838 


1.259 






n = 


10 4 


1.259 


0.681 


0.901 


0.767 


0.681 


0.848 


0.538 


0.926 






n = 


10 5 


1.000 


0.464 


0.658 


0.538 


0.464 


0.611 


0.346 


0.681 



Table 1: Comparison of the relative rate of convergence for the CV, PI and SCV selectors. 
For each combination of r, n and d in the table, the left entry in the corresponding cell 
shows n -d/(2d+4r+8) ( C y se i ector ) and the right entry n -2/(d+2r+6) ( PI and g C y se i ectors ) 

divided by 1000" 1/6 (i.e. the rate for the CV selector with n = 1000, d = 2, r = 0). 



Table |l| shows the values of n -d/(2d+4r+8) ( C y) and n -2/(d+2r+6) ( PI and SCV ) divided by 
lOOO -1 / 6 , that is the rate for the CV selector n = 1000, d = 2, r = which is used as a base 
case, for all the different combinations of n = 10 3 ,10 4 ,10 5 , d = 2,3,4,5 and r = 0,1,2. 
Ratios which are lower than 1 indicate the rate is faster than the base case, and ratios 
greater than 1 a slower rate. For n = 10 3 , these ratios in Table [l] tend to be greater than 1, 
indicating that using this sample size will lead to a deteriorating convergence rate. On the 
other hand for the larger sample sizes, n = 10 4 , 10 5 , these ratios tend to be less than 1. This 
implies that convergence rates better than the CV selector for bivariate density estimation 
can be attained, even with higher dimensions and higher order derivatives, provided that 
sufficiently large (although still realistic) sample sizes are used. Of course this comparison 
only takes into account the asymptotic order of the convergence rates by ignoring the 
associated coefficients since explicit formulas for the latter are not available for d>2. The 
finite sample behaviour of the bivariate case for moderate sample sizes is examined more 
closely in the next section. 

4 Numerical study 
4.1 Data-based algorithms 

For most practical implementations the normal kernels are used, i.e. K = L = <p. For d x d 
symmetric matrices A, B, and for r, s > 0, let 

mrM^ A, B, S) = [(vec T A)® r 8) (vec T B)® s ]D® 2r+2s fc(x) 
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and write, for short, rj2 r (x;'E) = r]2 r ,o(x; Id) Id) S) and i/ r (S) = (-l) r 7?2r(0; £)/0s(O). 
Then the cross validation criterion can be rewritten as 

CV r (H) = (-l) r {n~ 2 mrOti - X j; 2H) - 2[n(n - l)]" 1 ^ ^(X, -X i5 H) j. 

Besides, the data-based m-stage selection algorithm for plug-in selectors is given by: 
1. Initialize the m-th stage pilot selector to be the normal reference selector 

2 



"PI,2r+2m+2 



2r + 2m + d + 2 



2/(2r+2m+d+4) 



2g n -2/(2r+2m+ci+4) 



from Chacon, Duong and Wand (2011 ), where S is the sample variance of Xi, . . . X n . 

2. For k = m — 1, m — 2, . . . , 1, the optimal fe-the stage pilot selector Gpi i 2r+2fc+2 is the 
minimizer of 



2 = n" 2 |G|- 1 (2^)- d OF(2r + 2k + 2)u r+k+l {G- 2 ) 
+ (-l) r + fc+1 (2vr)- d / 2 OF(2r + 2fc + 2)|Gr 1/2 n~ 3 

n 

x 7 ?2,2r+2fe+2(Xi — XjJ G, G X , Gpi ) 2 r +2ft+4) 



+ 



i 71 4 [ r /2,2r+2fe+2(Xj — XjJ G, Irf, Gpi 5 2r+2fc+4) 

where OF(2p) = (2p— l)(2p — 3) • • • 5 • 3- 1 for p E N. The numerical minimization over 



the class of positive-definite matrices is carried out as described in detail in Duong 



and Hazelton (2005b, Section 5.1). 



3. The plug-in selector Hpi jr is the minimizer of 
PI r (H) = n^lHr^-^V^VtH- 1 ) 

n 

+ (-l) r (2n)- 2 %,2r(X i -X J -;H,I <lj Gpi,2r+4). 
The derivations of ||£pi,2r+2fc+2(G)|| 2 and PI r (H) in the rj functional form can be found 



m 



Chacon and Duong (2012). There it is also shown that, although it appears that these 



are less concise than the previous expressions, they facilitate efficient computation, both in 
terms of memory and execution time. 

We observe that ||^scv,2r+2fc+2|| 2 is the same as ||^pi,2r+2fe+2|| 2 except the three terms 
are multiplied by 2~ d ,2~ d / 2+1 and 4 respectively; since 0(0) = 02i(O) = 2~ d / 2 0(O) and 
r71 2(0) = 2m2(0). Furthermore, 

n 

SCV r (H) = n- 1 |Hr 1 / 2 2~( d+r )vr- d / 2 ^(H- 1 ) + (-lfrT 2 ^ [^(X* - X i5 2H + 2G) 
- 2r ?2 r(X i - Xj-; H + 2G) + ^(X - X r , 2G)] . 
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So a data-based m-stage SCV selector is obtained from straightforward modifications of the 
PI selector algorithm above. 

4.2 Simulation study 

The bandwidth selectors included in our simulation study were 

• OR: oracle, i.e. the minimizer of the MISE for the target density 



NR: normal reference from [Chacon, Duong and Wand (2011, Theorem 6), which is 



equal to [4/(d + 2r + 2 )] 2 /( d+2r+ ^Sn- 2 / ( - d+2r+ ^ 
CV: cross validation from Section [3.11 



PI: plug-in with 2-stage unconstrained pilots from Section 3.2 



SCV: smoothed cross validation with 2-stage unconstrained pilots from Section 3.3 



We have developed efficient implementations of all these selectors and incorporated them 
into the existing R library ks ( Duong , 2007[ ) . The target bivariate normal mixture densities 
that we considered are displayed in Figure [TJ Their explicit definitions can be found in 



Chacon (2009). 



Density #1 Uncorrelated Normal 



Density #6 Bimodal 



Density #7 Separated Bimodal 






-3-2-10 1 2 3 
Density #8 Asymmetric Bimodal 



-3-2-10 1 2 3 
Density #11 Double Fountain 



-3-2-10 1 2 3 
Density #12 Asymmetric Fountain 






Figure 1: Target bivariate normal mixture densities 
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For each selector and target density and for r = 0,1,2, we generated 100 samples of 
size n = 1000. The integrated squared error (ISE) between the resulting density estimate 
and the target density was computed as our measure of performance. The box plots of the 
log ISE are shown in Figure [2] We also conducted the study for sample sizes n = 400 and 
n = 4000 but the conclusions extracted were much the same as for n = 1000 so we decided 
not to include these results here to avoid redundancies. 

By construction, the oracle selector (OR) is the best possible selector in terms of MISE 
given that the true target normal mixture density was used for its computation. As ex- 
pected, it also has the uniformly lowest ISE. The normal reference selector (NR) was the 
only data-based selector previously available in the literature, and the results show that it 
is suitable only for density #1 since its ISEs for the other densities are uniformly higher 



than those of the other selectors. In line with other published simulation studies (Cao, 



Cuevas and Gonzalez-Manteiga 1994, Jones, Marron and Sheather, 1996), the CV selector 



displays larger variability in the ISEs than the PI and SCV selectors, though the former 
presents lower mean ISEs in some cases, e.g. density #12, r = 0, 1. We note also that 
the CV variability tends to increase with increasing r, whereas this is not observed for the 
two other hi-tech selectors. An anonymous referee drew our attention to the low variability 
of the introduced bandwidth selectors for this density #12, as compared with that of the 
oracle. This density has very complicated features, like modal regions of different shape and 
size, so this is the scenario where usually oversmoothing occurs, and we checked that this is 
indeed the case: the oracle tries hard to discover the true structure (hence its high variabil- 
ity), whereas all the data-driven bandwidths tend to consistently prefer a more conservative 
estimate, slightly oversmoothed. Given that the construction and theoretical properties of 
the PI and SCV selectors are similar, it is not surprising that their ISE performance is 
correspondingly similar for all the cases examined here. Either of these selectors would thus 
be our recommendation over the CV and NR selectors. 

5 Applications to mean shift clustering 



The so-called mean shift algorithm (Fukunaga and Hostetler, 1975) is an iterative procedure 



which, at every step, shifts the point obtained in the previous iteration in the direction of 
the density gradient, producing a convergent sequence that transports any initial value to 
a local maximum of the density along the steepest ascent path. 

Specifically, the mean shift clustering algorithm can be described as follows: an initial 
point Yo is transformed recursively to obtain a sequence defined by 

Y J+1 =Y,+AD/(Y,)//(Y,), (4) 

where / is an arbitrary density estimator, D/ is an estimator of the density gradient, 
and A is a fixed d x d positive definite matrix, properly chosen to guarantee convergence 
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Figure 2: Box plots of the logarithm of the ISEs for bandwidth selectors for n = 1000, r = 
0, 1, 2 for the six bivariate target densities. 
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of the sequence (Yo,Yi,...). This is easily recognized as a variant of the well-known 
gradient ascent algorithm employed to find the local maxima of a given function, but using 
the normalized density gradient (i.e., the density gradient divided by the density itself) 
instead of just the gradient in its definition. The advantages of using such a normalization 



are illustrated in Fukunaga and Hostetler (1975), Cheng (1995) and Comaniciu and Meer 



(2002); one of them is to accelerate the convergence of the resulting sequence for initial 



values of low density. 

When kernel estimators are used in Q the above procedure attains a particularly simple 
form. Assuming that the kernel K is a spherically symmetric function it follows that 
K(x) = |/c(||;e|| 2 ), where the function k: M + — > M is known as the profile of K. Under the 
usual conditions that K is smooth and unimodal, its profile is decreasing so that g(x) = 



—k'(x) > 0. Therefore, noting that DK(x) 
estimator can be written as 



-a;g(||cE|| 2 ), the kernel density gradient 



D/h (as) = n-^Hr^H" 1 J2(X { - x)g((x - X^HT^x - X*)) = H^ 1 fu(x)m M (x) , 

i=l 

(5) 

where fn(x) = n _1 |H| -1 / 2 Y17=i s{( x ~ Xj) T H _1 (a; — Xj)) can be understood as an un- 
normalized kernel estimator of / and the term 



x 



m H {x) 



XQ) 

E^x^^-x^h-Hx-x,)) 



X 



is known as the mean shift. Thus, equation (|5j) can be re-arranged to note that H~ mn(x) 
provides a reasonable estimator of the normalized density gradient, and by taking A = H 
in equation Q it leads to the recursively defined sequence 

Er = iX5((Y,-X. t ) T H- 1 (Y J -X)) 



Y i+1 = Yj + m H (Y i ) 



EIU^y.-x^h-hy.-xo) 



(6) 



When A; is a convex and monotonically decreasing profile, and H = /i 2 Id, Comaniciu and 
Meer (2002, Theorem 1) showed that the sequence (Yo, Yi, . . .) defined in this simple way 
converges to a local maximum of /h, and their proof can be easily adapted to cover the 
case of an unconstrained H as well. The recursive formulation ^ was also motivated as 



an EM- type algorithm for mode finding in Li, Ray and Lindsay (2007), who proved its 
convergence under more general conditions. 

Since the direction along which the data points are shifted, as well as the limit points 
of the sequences of successive locations (i.e., the solutions of Dfn(x) = 0), are directly 
related to the density gradient, our proposal is to take H in the mean shift algorithm as a 
bandwidth matrix selector for multivariate kernel density gradient estimation, using any of 
the methods introduced in Section [3| This choice is also supported by the results in Grund 
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and Hall 


(1995 


) and 


Vieu 


(1996) 



for estimating the mode of a density is closely related to the problem of density derivative 
estimation. Thus, the bandwidth choice is made with the goal of optimal identification 
of the density features in mind. This is in contrast with other proposals, as for example 



Comaniciu (2003), where a different criterion is taken into account to obtain an automatic 
variable-bandwidth selection algorithm. 

When only a few iterations of the mean shift algorithm are performed, it is probably 
the case that convergence has not been reached yet. However, the procedure is still useful 



for other tasks. These include data filtering (Fukunaga and Hostetler, 1975), which seeks 



to reduce the effect of noise in the determination of the geometric properties of a data set 



or in finding local principal curves, and also data sharpening (Choi and Hall 1999, Hall 



and Minotte , 2002 ) , which can be used to reduce the bias in kernel curve estimation and to 
adapt kernel estimators to pre-specified curve constraints. 

The main statistical application of the mean shift procedure is for cluster analysis. 



For several additional applications in engineering, see Cheng (1995), Comaniciu and Meer 



(2002), Comaniciu, Ramesh and Meer (2003). When the mean shift algorithm is applied 



with any of the data points as starting value it induces a partition of the data in a natural 
way, by assigning the same cluster to all the data points that converge to the same local 



maximum. This is called modal clustering in Li, Ray and Lindsay (2007). Notice that 
this methodology does not require the number of clusters to be specified in advance, and 
that it allows clusters of arbitrary shape to be discovered. Moreover, since the mean shift 
algorithm can be applied with any starting point, it does not produce only a partition of 
the data, but a partition of the whole space. 

To illustrate the use of mean shift clustering Figure [3] shows the result of applying the 
mean shift algorithm to a sample of size n = 210 from a trimodal normal mixture (the one 



labeled Trimodal III in Wand and Jones (1993)). The black bold stars show the location 
of the three modes found and the paths in grey starting from every data point depict their 
ascent towards their associated density mode. 



5.1 Simulation results 

As pointed out above, any of the bandwidth selection methods for kernel density gradient 
estimation introduced in Section [3] leads automatically to a new nonparametric clustering 
procedure via the mean shift algorithm. To explore the finite sample properties of these 
new proposals, their performance is compared here to other related existing methods. 

Given the enormous amount of literature on clustering techniques, it would be impossible 
to include all the different clustering procedures in this comparison, so a brief selection of 
techniques similar to the one introduced here have been considered: 



CLUES algorithm (CLUstEring based on local Shrinking), proposed in Wang, Qiu and 



IS 




-2 



Figure 3: Paths followed by the sample points as a result of the application of the mean 
shift algorithm. Sample of size n = 210 from a trimodal normal mixture density. 



Zamar (2007), is an iterative algorithm closely related to the mean shift algorithm, 



but in which the shift is performed at each iteration by computing the coordinate-wise 
median of the K- nearest neighbors of the previous iteration point. 



PDFC algorithm (PDF Clustering), proposed in Azzalini and Torelli (2007), is also 
based on a kernel density estimate. Its high density regions are computed and the 
connected components of this regions are identified as sample clusters. The bandwidth 
used in the kernel estimate is just a diagonal normal scale rule-of-thumb for the density 
(not the density gradient), multiplied by a subjectively chosen shrinkage factor 3/4 to 
correct for oversmoothing. We are aware of the existence of other clustering methods 



based on high density regions, as for instance Cuevas, Febrero and Fraiman (2001) 
or Rinaldo and Wasserman (2010), but decided to include in this admittedly limited 



study only the PDFC algorithm due to its simplicity. 



MCLUST algorithm (Mixture model CLUSTering), as surveyed in Fraley and Raftery 



(2002), is included in the comparison since it can be recognized as the parametric 



golden standard. 

These three methodologies are compared with mean shift clustering using unconstrained 
bandwidth matrices for density gradient estimation obtained with: 1) the normal scale rule 
derived in |Chacon, Duong and Wand ( 2011| ) (labeled NR), 2) the cross-validation bandwidth 
(labeled CV), 3) the plug- in bandwidth (labeled PI), and 4) the smoothed cross-validation 
bandwidth (labeled SCV). 

The comparison is made along five test clustering problems, generated by five bivariate 
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mixture densities that have been chosen to investigate the performance of the methods 
in a typical parametric setup (two normal mixture densities) and in situations with non- 
ellipsoidal cluster shapes, having also different scales. Specifically, the five mixture densities 
in the study are: 



1. Trimodal III density from Wand and Jones (1993). 



2. Quadrimodal density from Wand and Jones (1993) 



3. 4-crescent model. This model is intended to mimic the distribution explored in Figure 



7 of Comaniciu ( 2003 ) . Since an explicit expression of the density function is not given 



there, our model has been generated as a suitable modification of Experiment 4 in 



Fukunaga (1990, p. 546). Namely, a bivariate random vector X is defined to have 
a crescent distribution with center O £ M 2 , radius r > and convexity indicator 
k G {0,1}, denoted C(0,r,«) if X = O + (r cos 6, (-l) K r sin6) T + U, where G 
is normally distributed with mean 7r/2 and variance (7r/6) 2 and U is a bivariate 
centred normal vector with variance matrix (r/20) 2 l2- Then, the 4-crescent model is 
the equally weighted 4-component mixture density with components C((— 1, 1) T , 1, 1), 
C((0, 0.5) T , 1, 0), C((0, 0) T , 0.5, 1) and C((0.5, -0.5) T , 0.5, 0). 

4. Broken ring model. This model aims to reproduce the sampling scheme shown in 



Figure 3 in Wang, Qiu and Zamar (2007). Precisely, a bivariate random vector X 



is defined to have a standard half-crescent distribution with mean angle 6, denoted 
HC{6) if X = (cos0,sinO) T + U, where is normally distributed with mean and 
variance (7r/12) 2 and U is a bivariate centred normal vector with variance matrix 
(1/20) 2 I2- Then, the broken ring model is the 5-component mixture density having a 
centred normal component with variance (1/5) 2 I2 and weight 1/4, and four standard 
half-crescent components with equal weights 3/16 and mean angles vr/4, 37r/4, 57r/4 
and 77r/4, respectively. 

5. Eye model. This model is a variation of the former. It is also a 5-component mix- 
ture density with a centred normal component with variance (1/5) 2 I2 as before, but 
with a weight 1/20. The other 4 components are centred crescent distributions (i.e., 
O = (0,0) T ), two of them with radius 1 and the two possible convexity indicators, 
respectively, having weight 1/8 each; and the other two with radius 1.5 and also the 
two possible convexity indicators, but with weight 7/20 each, and rotated 90 degrees. 

A clearer picture of all these models is provided by Figure |4j which shows samples of size 
n = 800 for each of them. 



In common with Azzalini and Torelli (2007), Wang, Qiu and Zamar (2007) and many 



others, the performance of each clustering method is measured through the adjusted Rand 
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Figure 4: Samples of size n = 800 from each of the models considered in the clustering sim- 
ulation study. Different cluster membership is indicated with different plotting characters 
and colours. 



index (ARI), which was introduced by Hubert and Arabie (1985) as a corrected- for-chance 
version of the proportion of agreements between two partitions of a given data set. This 



index is the overall preferred accuracy measure in the simulation study of Milligan and 



Cooper (1986). An ARI value of 1 indicates that all estimated memberships are the same 



as the true memberships, whereas a value close to indicates that the estimated cluster 
assignation does not differ much from random assignment. For the comparison, 100 samples 
of size 500 were drawn from each of the five test models, the data were clustered according 
to the seven methods in the study (four mean shift procedures plus CLUES, PDFC and 
MCLUST) and the ARI was computed to measure the performance of each method for each 
of these data sets. Table [2] presents the average ARI values obtained. 

In view of Table [2j none of the methods compared is uniformly the best. In the group 
of the mean shift procedures, the use of the PI bandwidth seems to exhibit the best overall 
performance. The CV choice can be rated second best, with similar or even slightly (but not 
significantly) better average ARI in some cases. The SCV bandwidth shows an unexpectedly 
inferior performance for the normal mixture models, but it has an acceptable behaviour for 
the models with non-standard cluster shapes. Finally, the normal scale rule NR is clearly 
inferior in four out of the five models, but it performs surprisingly well for the broken ring 
model; since it is the least intensive method in computational terms, it could be useful at 
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NR 


CV 


PI 


SCV 


CLUES 


PDFC 


MCLUST 


Trimodal III 


0.700 


0.694 


0.752 


0.546 


0.583 


0.754 


0.715 


Quadrimodal 


0.518 


0.617 


0.630 


0.505 


0.519 


0.641 


0.790 


4-crescent 


0.569 


0.920 


0.913 


0.932 


0.805 


0.834 


0.481 


Broken ring 


0.983 


0.918 


0.983 


0.986 


0.984 


0.975 


0.811 


Eye 


0.606 


0.742 


0.765 


0.585 


0.548 


0.544 


0.420 



Table 2: Average adjusted Rand index (ARI) for 100 simulation runs of size n = 500 of 
each distribution. 



least to provide a quick initial analysis, especially in higher dimensions. 

The comparison with the parametric method MCLUST followed the expected guidelines: 
for the normal mixture models MCLUST showed good results, especially for the difficult 
quadrimodal density, but it seems unable to adapt itself to the non-standard cluster shape 
situations. On the contrary, CLUES is not very powerful for a standard setup with ellip- 
soidal clusters, but seems to performs reasonably well for non-standard problems. Finally, 
PDFC shows remarkable results in the simulation study, in spite of the ad hoc choice of the 
bandwidth in which it is based, and its performance is comparable to that of the best mean 
shift procedure, with the only exception of the eye model. Surely a more careful study of 
the bandwidth selection problem would improve the quality of the PDFC method further. 

5.2 Real data examples 

The mean shift algorithm in conjunction with the new proposed bandwidth selection rules 
was also applied to some real data sets. It is well-known that the kernel density estimator 
tends to produce spurious bumps (i.e., unimportant modes caused by a single observation) 
in the tails of the distribution, and that this problem seems enhanced in higher dimensions, 
due to the empty space phenomenon and the curse of dimensionality (see, for instance, 



Simonoff, 1996, Chapter 4). For real data sets, this may result in a number of data points 
forming singleton clusters after applying the mean shift algorithm. 

Furthermore, in some applications the researcher may be interested in forming more 
homogeneous groups so that, say, insignificant groups of size less than a% of the biggest 
group are not allowed in the outcome of the clustering algorithm. This goal can be achieved 
as follows: apply the mean shift algorithm to the whole data set and identify all the data 
points forming groups of size less than a% of the biggest group, then leave those singular 
data points out of the estimation process in the mean shift algorithm and re-compute the 
data-based bandwidth and the density and density gradient estimators in Q using only 
non-singular data points. Since the mean shift algorithm produces a partition of the whole 
space, these left-out data points can be naturally assigned to any of the corresponding newly 
obtained clusters. If this new assignment again contains insignificant clusters then iterate 
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the process until the eventual partition satisfies the desired requirements. This correction 
is similar (although a little different) to the stage called "merging clusters based on the 
coverage rate" in Li, Ray and Lindsay (2007), and will be referred henceforth as correction 



for insignificant groups. 



5.2.1 E.coli data 



The E. coli data set is provided by the UCI machine learning database repository (Prank 



and Asuncion, 2010). The original data were contributed by Kenta Nakai at the Institute 



of Molecular and Cellular Biology of Osaka University. The data represent seven features 
calculated from the amino acid sequences of n = 336 E.coli proteins, classified in eight 
classes according to their localization sites, labeled imL (2 observations), omL (5), imS (2), 
om (20), pp (52), imU (35), im (77), cp (143). A more detailed description of this data 



set can be found in Horton and Nakai (1996). Since two of the original seven features are 
binary variables, only the remaining five continuous variables {d = 5), scaled to have unit 
variance, were retained for the cluster analysis. 

The number of groups identified by the mean shift procedure with correction for in- 
significant groups (using a = 5% as a default) was 5 for PI and SCV bandwidths, which 
is the natural choice if the insignificant clusters imL, omL and imS are merged into bigger 
groups. The mean shift algorithm found 6 groups using the NR bandwidth and 7 with the 
CV bandwidth. Since in this example the true cluster membership is available from the 
original data, it is also possible to compare the performance of the methods using the ARI. 
The ARIs for these configurations were 0.63 (NR bandwidth), 0.671 (CV), 0.667 (PI) and 
0.559 (SCV). In contrast, CLUES and PDFC indicated a severely underestimated number of 
groups in the data, namely 3 and 2, respectively, and whereas CLUES obtains a remarkably 
high ARI anyway (0.697), the performance of PDFC is poor for this data set in ARI terms 
(0.386). MCLUST also gives a reasonable answer, with 6 groups and an ARI of 0.642. 



5.2.2 Olive oil data 



These data were introduced in Forina et al. (1983), and consist of eight chemical measure- 
ments on n = 572 olive oil samples from three regions of Italy. The three regions Rl, R2 
and R3 are further divided into nine areas, with areas Al (25 observations), A2 (56), A3 
(206) and A4 (36) in region Rl (totalling 323 observations); areas A5 (65) and A6 (33) in 
region R2 (totalling 98); and areas A7 (50), A8 (50) and A9 (51) in region R3 (totalling 
151). Detailed cluster analyses of this data set are given in |Stuetzle| fl2003D and |Azzal mi 



and Torelli (2007). Taking into account the compositional nature of these data, they were 



transformed following the guidelines in the latter reference, first dealing with the effect of 
rounding zeroes when the chemical measurement was below the instrument sensitivity level 
and then applying the additive log-ratio transform to place the data in a 7-dimensional 
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Euclidean space (see Pawlowsky-Glahn and Buccianti, 2011, for a recent monograph on 



compositional data). Then, cluster analysis was carried out over the first five principal 
components of the scaled Euclidean variables. 

The results of the analysis indicated that whereas some methods seemed to target the 
partition of the data into major regions, others tried hard to discover the sub-structure of 
areas. This was clearly recognized when the ARIs of the groupings were computed either 
with respect to one classification or the other. Naturally, if a method produced a grouping 
which was accurate with respect to major regions, it had lower ARI with respect to the 
division into areas. 

CLUES, PDFC and the mean shift algorithm using the NR bandwidth clearly favoured 
grouping the data into major categories. The PDFC method obtained a remarkable ARI of 
0.841 by clustering the data into 3 groups, whereas CLUES only found 2 groups resulting 
in an ARI of 0.680. Using the NR bandwidth the mean shift algorithm achieved an ARI of 
0.920 with respect to the true grouping into major regions; it correctly identified all the data 
points in regions Rl and R2, although region R3 appeared divided into several subregions. 

In contrast, MCLUST and the mean shift algorithm combined with all the more so- 
phisticated bandwidth selectors tended to produce groupings closer to the assignment into 
smaller areas. MCLUST showed the existence of 8 groups and achieved an ARI of 0.739 
with respect to the true distribution into areas. The mean shift analyses with the CV, PI 
and SCV bandwidths all found 7 groups, leading to ARIs of 0.741 (CV bandwidth), 0.791 
(PI) and 0.782 (SCV). 

6 Applications to bump-hunting with feature significance 

It is not always easy to interpret visually estimates of multivariate derivatives. To assist 



us, we use the significant negative curvature regions of Duong et al. (2008), defined as 
the set containing the values of x £ W 1 such that the null hypothesis that the Hessian 
H/(cc) is positive definite is significantly rejected. The appropriate kernel test statistic, 



null distribution and adjustment for multiple testing is outlined in Duong et al. (2008) and 
implemented in the feature library in R. Significant negative curvature regions corresponds 
to a modal region in the density function, and hence a local maxima in data density. These 
authors focused on the scale space approach of smoothing and so did not develop optimal 
bandwidth selectors for their density derivative estimates. 

Here, we compare the significant curvature regions obtained using a usual r = band- 
width selector to those with an r = 2 optimal bandwidth in Figure [5] on the earthquake data 



from Scott (1992). The recorded measurements are the latitude and longitude (in degrees) 



and depth (in km) of epicenters of 510 earthquakes. Here, negative latitude indicates west 
of the International Date Line, and negative depth indicates distances below the Earth's 
surface. The depth is transformed using -log(-depth). For these transformed data, we use 
PI selectors Hp^o and Hpi^ and SCV selectors Hgcv,o and Hgcv,2- 
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As expected from asymptotic theory, bandwidths for Hessian estimation are larger in 
magnitude than bandwidths for density estimation. Moreover only the central modal region 
is present using Hp^o, whereas with Hpi.2, the three local modal regions are more clearly 
delimited from the surrounding space, confirming the three modes obtained with subjective 
bandwidth selection by Scott (1992). 
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Figure 5: Significant negative curvature regions (in blue). (Upper left) Plug-in selector 
r = 0. (Upper right) Plug-in selector r = 2. (Lower left) SCV selector r = 0. (Lower 
right) SCV selector r = 2. The significant curvature regions or modal regions are more 
clearly delimited from the surrounding scatter point cloud with the selectors corresponding 
to second derivative. 
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A Appendix: proofs 

Henceforth the following assumptions are made: 

(Al) K is a symmetric c?-variate density such that J xx T K(x)dx = m2(if)Id and all its 
partial derivatives up to order 2r + 1 are bounded, continuous and square integrable. 

(A2) / is a density function with all its partial derivatives up to order 2r + 6 bounded, 
continuous and square integrable. 

(A3) H = H n is a sequence of bandwidth matrices such that all entries of n *|H| 1//2 (H i)® r 
and H tend to zero as n — > oo. 

These do not form a minimal set of assumptions, but they serve as useful starting point 
for the results that we subsequently develop. Besides, in this section integrals without 
any integration limits are assumed to be integrated over the appropriate Euclidean space. 
We also assume that suitable regularity conditions are satisfied so that the exchange of 
term-by-term integration and differentiation of Taylor expansions are well-defined. 



Proof of Lemma [7J Reasoning as in Lemma 1 in Duong and Hazelton ( 2005a ) , it follows that 
vec(H r — Hmise.t-) is asymptotically equivalent to — [DEMISE,. (H]ynSE,r)] _1 DH[MISE r — 
MISE r ](HMisE,r)) where = d 2 /{d vec H<9 vec T H) denotes the Hessian operator corre- 
sponding to Dh- Therefore, it suffices to show that D^MISE r (HMisE,r) = 0(J d 2). But 



for any H with entries of order 0(n 2 /(d+2r+4)^ ag Hmise,t-> the results in Chacon, Duong 



and Wand (2011) imply that MISE r (H) is equivalent to AMISE r (H) and, moreover, the 
smoothness assumptions ensure that D^jMISE r (H) is of the same order as D 2 ^ AMISE r (H). 
And it is not hard to show that for the asymptotic integrated squared bias term we have 
D^j{i/jJ r+4 ( veclrfr (g) (vecH)® 2 )} = 0(J d 2) and similarly for the asymptotic integrated 
variance, thus finishing the proof. □ 

A.l Convergence rate for the CV bandwidth 

Lemma [l] shows that vec(Hcv,r — Hmise,t) is asymptotically equivalent to Dh[CV,. — 
MISE r ](H M isE,r)- Since E[CV r (H)] = MISE r (H) - trR(D® r /) for all H, it follows that 
the order of vec(Hcv,r — Hmise,?-) is given by the (root) order of 

Var {D H [CV r - MISE r ](H M i S E,r)} 

~ Var j[n(n - l)]" 1 ^ D H [ vec T (H- 1 )^(D® 2 ^) H (X J - X,)] 

^ i-l-A 



H=H 



MISE,r 



where K = K * K — 2K. So denoting (p H (x) = D H [vec T (H.- 1 )® r (D® 2r K) H (x)] , by 
standard [/-statistics theory the previous variance is of the same order as 4n _1 (3i — Hq) + 



27 



2n 2 S 2 , where 

Si = E[¥> H (Xi - X 2 )^ H (Xi - X 3 ) T ] 
3 2 = E[ ¥>H (Xi - X 2 )^ H (Xi - X 2 ) T ] 
H = E[¥> H (Xi - X 2 )] E[<^h(Xi - X 2 )] T 

with H of the order of Hmise^j namely having all its entries of order 0{n~ 2 ^ d+2r+ ^). The 
following lemma provides an explicit expression for the function <£> H (a:) that will be helpful 
to evaluate H p ,p = 0, 1, 2. 

Lemma 2. The function <£> H (:c) can be explicitly expressed as (fn(x) = A(D® 2r K)n(x) + 
Bp H (a:) where the function p: M d — > M d2r+2 is given by p(x) = (I d 2r ®x®Id)D®( 2r+1S) K(x) 
and the matrices A = A(H) G Ai^y.^, B = B(H) G A4j2 X( pr+2 are defined as 

A = -i(vec T H " r ® vecH- 1 ) - r(vec T H® - ^ -1 ^ ® H®~ 2 ) 

B = -[vec T H®" r ® (H 1 / 2 ® H 1 / 2 + I d ® H)" 1 ] 

w/iere we understand that H®~ r = (H^ 1 )®*" = (H® r ) _1 . 

Proo/. Since vec T H®- r (D 02r ^T) H (a;) = IH]" 1 / 2 vec T H - r D® 2r ^(H^ 1 / 2 ^), its differential 
is decomposed into three terms 

d(vec T YL®- r (D® 2r K) n (x)) = ddHl -1 / 2 ) vec T H^-T^i^H" 1 / 2 ^ 

+ |H|- 1 / 2 (dvecH®- r ) T D® 2r ^(H- 1 / 2 a; ) 

+ |H|- l ' 2 vec T H^d(D® 2r if (Br 1 / 2 *)) . 



From Chacon and Duong (2010), the differentials involved in the first two terms can be 
expressed as 

d(|H|~ 1/2 ) = -i|H|- 1/2 (vec T H" 1 )dvecH and 
d(vecH®- r ) = -IYfvecH®-^- 1 ) ® H - 2 ]dvecH, 
where Y r is a matrix such that r^D® 2r = rD® 2r . For the third term, 

vec T H®- r d(D® 2r K{HT l / 2 x)) = vec T H®" r ((D® 2r D T )K) (H~ 1 / 2 a:)d(H~ 1 / 2 x) 

= D 02r+1 ^(H~ 1 / 2 a ; ) T (vecH®- r ® I d )d(H- 1 / 2 a;), 
since [D(D T )® 2r ] vec H®" r = vec (Lj[D(D T )® 2r ] vec H®" r ) = (vec T H " r ® Lj)D® 2r+1 . Fi- 



nally, using vecH from Chacon and Duong 



( |2010| ), it follows that <i(H _1 / 2 a;) = (> T ® Lj^vecH- 1 / 2 = — (s T H -1 / 2 ® ® H + 

vecH. Thus the derivative reads 

D H (vec T H®- r (D 02r K) H (cc)) = -^H]- 1 ' 2 ^ H 8 " r ® vecKT^D® 2 ^ (HP 1 / 2 :!;) 

- rlH)- 1 / 2 ^ H^" 1 ) ® H^" 2 )D^ 2r K(H- 1 / 2 S 

- |H|- 1 / 2 (H 1 / 2 ® H 1 / 2 + I d ® Yiy l {U.- 1 / 2 x ® I d ) 
x (vec T H®" r ® l d )D® 2r+1 K{Yi- l l 2 x). 



x 
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The central factors of the third term on the right hand side can be rewritten as 
(H 1 / 2 8 H 1 / 2 + I^H)- 1 ^- 1 / 2 ^ 8) I d )(vec T H®~ r 8 I d ) 

= (H 1 / 2 (8) H 1 / 2 + I d 8 H)- 1 (vec T H®" r 8 H" 1 ^ 8 I d ) 

= vec T H " r (8) [(H 1 / 2 8 H 1 / 2 + I rf 8 H) -1 (H _1 / 2 a; 8 

= [vec T H 8 " r (H 1 / 2 8 H 1 / 2 + I d <g> H)" 1 ]^ 8 Hr 1/2 x . 

as desired. 



□ 



We now return to the task of finding the asymptotic order of Ho, Hi and H2. For that, 
some preliminary notation is needed. For any real function a we denote its vector moment 
of order p as n p (a) = j Rd x® p a{x)dx. For instance, Chacon and Duong (2011) showed that 
Ho{K) = -1, n x (K) = ix 2 (K) = fi, 3 {K) = and n A {K) = Qm 2 {K) 2 S dA (vec l d )® 2 , where 
Sd,r denotes the symmetrizer matrix of order r (see Holmquist, 1985), defined as the (only) 
matrix such that pre-multiplying a Kronecker product of any r vectors in M. d by S^ r results 
in the average of all possible permutations of the r-fold product. We also introduce here 
the notation K m n for the commutation matrix of order mn x mn ( Magnus and Neudecker 



1979). 



So taking this into account, for the calculation of the asymptotic order of Hq, a fourth 



order Taylor expansion of D® 2r f(x — H 1 / 2 ^), in the form of Kollo and von Rosen (2005 
Theorem 1.4. 



or 



Chacon, Duong and Wand (2011), gives 



(D® 2r K) H * f(x) 

= J D® 2r K{z)f{x - U 1 / 2 z)dz 

= (H 1 / 2 )^ J K(z)D® 2r f(x - H^ 2 z)dz 



(H l/ 2) ®2r^Li)! f Kiz^r ® (z T H l / 2 )®P}D® 2r +Pf(x)dz 



p=0 
4 



(H l/ 2) ®2r [I d2r 8 (/X p (K) T (H 1 / 2 )^)] D® 2r+ " 

p=0 p ' 



/(*) 



r+4 



r ^f(x), 
8 vec T H)(I d 4 + 



= -(H 1 ' 2 )® 2r D® 2r /(a;) + ^{Kf^l 1 ' 2 )^ [l^ 8 ((vec T H)® 2 S dA )] D 

= _(H 1 /2)®2r D ®2r /(a;) + 1^(^)2 [(ftl^r ^ H) ®2] D ®2 

Therefore, since vec T H®" r (H 1 /2)®2r = vec T Irfr and D H (vecH)® 2 = (1^ 
K. d 2 d 2), we obtain 

<p n * f{x) = D H [ vec T U®- r (D® 2r K) n * f(x)] 

~ D H {\m 2 (K) 2 [vec T I d r 8 (vec T H)® 2 ] D® 2r+4 /(a:)} 

= \m 2 {K) 2 { vec T I d r 8 I* 8 vec T H) [1^ 8 (I* + K dV2 )]D® 2r+4 f(x 



lm 2 (K) 2 (vec T 1, 



vec T H)D® 2r+4 f(x). 
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Using this, 

E[¥> H (Xi - X 2 )] = J ip u *f(x)f(x)dx~ ±m 2 (K) 2 (vec T I d r ®l d2 ®vec T H)*l, 2r 



+4 



and Ho = 0(J d 2) vecHvec T H. 
Similarly, 

~ \m 2 (K) 4 (vec T I d r ®I d 2 ® vec T H) j J D® 2r+4 f(x)D® 2r+A f{x) T f(x)dx 

x ( vec I d r (gi I rf 2 (8) vec H) 
= 0(J d2 )vecHvec T H. 

Finally, note that E 2 = |H|- 1 / 2 E[(¥J¥J T ) H (Xi-X 2 )], where tp(x) = A(D® 2r K)(x)+Bp(x), 
which also depends on H through A and B. Besides, 

E[( W T ) H (X 1 -X 2 )] = IJ(^ T )(z)f(y)f(y + il 1 / 2 z)dydz^ R(f) J cp(z)cp(z) T dz 

which, in view of Lemma § leads to H 2 = 0{3#\H\-V 2 ) vecH®-( r+1 ) vec T H®-( r+1 ). 
Putting all these together, since every element of H M i S E,r is 0(n- 2 /( ci+2r+4 )), 

4,n- 1 (S 1 - H ) + 2n- 2 3 2 ~ O^n"^ 2 ^ 4 )) vec H M i S E,r vec T H M i S E,r 

and therefore vec(H C v,r - H M isE,r) = 0(J d 2n" d/(M+4r+8) ) vec H M iSE,r- 

A. 2 Convergence rate for the PI bandwidth 

Henceforth, in addition to (Al)-(A3) the following assumptions on the pilot kernel L and 
the pilot bandwidth G are made: 

(A4) L is a symmetric (i-variate density such that f xx T L(x) dx = m 2 (L)Li and all its 
partial derivatives up to order 2r + 4 are bounded, continuous and square integrable. 

(A5) G = G n is a sequence of bandwidth matrices such that all entries of re~ 1 |G|~ 1 / 2 (G _1 )® r+2 
and G tend to zero as n — > oo. 

To make use of Lemma [T] once more, notice that the difference between the MISE and 
its estimate is 

PI r (H) - MISE r (H) ~ (-l) r ^^(Vv + 4(G) - ^ 2r+4 ) T (vecV ® (vecH)® 2 ) 
so taking into account DH(vecH)® 2 = il d 2 ® vec T H)(I^4 + K^^) again, we come to 
D H [PL(H) - MISE r (H)] ~ (-l) r ^4^(vec T l d r ® 1# ® vec T H)(^ 2r+4 (G) - </> 2r+4 ), 
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so that the performance of Hpi, r is determined by the performance of ip 2r +4,(G) as an 
estimator of V , 2r+4- 



From Theorem 2 in Chacon and Duong (2010) the optimal pilot bandwidth G for 



2r+4l 



the estimator tj> 2r+4: (G) is of order n 2 /( d + 2r + G ) j leading to E[||i/> 2r+4 (G) - ip 
( n -4/(d+2r+6)^ andthen D H [PIr(H)-MISE r (H)] = P (n- 2 ^ d+2r+ ^ 3 d 2 ) vec H. So finally 
we arrive to vec(Hpi jr — Hmise.t-) = Op(n^ 2 ^ d+2r+ ^3 d 2) vccHmise by applying Lemmajlj 

A. 3 Convergence rate for the SCV bandwidth 



As in Chacon and Duong (2011 ), it can be shown that the function MISE r can be replaced for 



MISE2 r everywhere in the asymptotic analysis, since the difference between their respective 
minimizers is of relative order faster than n -1 / 2 , which is the fastest attainable rate in 



bandwidth selection (Hall and Marron, 1991) 



So to apply Lemma [T] it is also possible consider MISE2 r instead of MISE r , hence 
we focus on analyzing the difference SCV r (H) — MISE2 r (H) at H of the same order as 
Hmise.t-- To begin with, note that using a fourth order Taylor expansion of D® 2r Z(G _1 / 2 a;— 
G-V2hV2 z ) results in 

A H *D® 2r L G (x) 

A n (z)D® 2r L G (x- z)dz 



= icr^c-V^r I A(z)D® 2r L(G- 1 / 2 x-G- 1 / 2 U 1 / 2 z)dz 

~ IGl-^fQ-Vajisar^ (zlE f A(z)[I d 2r (z T H 1 / 2 G- 1 / 2 )^]D^ 2r +PL(G^ 2 x) dz 

= im 2 (i^) 2 |G|- 1 / 2 (G- 1 / 2 )® 2r [I d2l . ® (vec T H)® 2 ^- 1 / 2 )® 4 ^ 2 ^!^- 1 / 2 *) 
= lm 2 (K) 2 \G\-^ 2 [l d 2r ® (vec T H)® 2 ]^- 1 ^^ 4 ^® 2 ^ 4 !^- 1 / 2 *) 
= \m 2 (K) 2 [I d 2r ® (vec T H)® 2 ]D® 2r+4 L G (aO, 

where we have made use of the fact that /xo(A) = /*i(A) = A* 2 (A) = /-t 3 (A) = and 
/i 4 (A) = 6rri2(K) 2 S di 4 : (vec I d )® 2 , and that the entries of G _1 H tend to zero as a conse- 
quence of (A3) and (A5). 

This asymptotic approximation is then used to expand the terms in 



E[SCV r (H) -MISE2 r (H)] = (-l) r vec T I d r |n -1 A H * D® 2r Z G (0) 

+ (1 - n" 1 ) E [(A H * D^ 2r Z G )(X 1 - X 2 )] — J Ah * D® 2r f(x)f(x) dxl. 

Precisely, for the first term we have 

A H * D® 2r Z G (0) ~ \m 2 {K) 2 \G\- l ' 2 [l d 2r ® (vec T H)® 2 ](G" 1 / 2 )® 2 '- +4 D® 2 '- +4 Z(0), 
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and for the second term 

E[(A H *D^Z G )(X 1 -X 2 )] 

~ \m 2 {K?[l d 2r ® (vec T H)® 2 ] / / D® 2r+4 L G (x - y)f(x)f(y) dxdy 



lm 2 (K) 2 [I d 2r®(vec T nr 2 ] I 'J ' L G (x - y)D^+ 4 f(x)f(y) dxdy 
m 2 (K) 2 [I d 2r ® (vec T H)® 2 ] J j L{w) ^ [I d 2, +4 (w T G 1 / 2 )® p ] 



i 



x 



D® 2r+4+ rf(y)f(y)dwdy 



2 



= \m 2 {K) 2 [l d , r ® (vec T H)® 2 ] £ ® {^(Z) T (G^ 2 )^}]t/> 2r+4+p 

= \m 2 {K) 2 [l d 2r ® (vec T H) 02 ]^ 2r+4 + im 2 (-fC) 2 m 2 (L)[I d 2, ® (vec T H) 02 vec T G]^ 2r+6 , 

since Hq(L) = l,/i 1 (Z) = and /x 2 (X) = 2fi 2 (L) = 2m 2 (L)vecI d . Finally, noting that 
D® 2r i£n = (H _1//2 )® 2r (D® 2r X)H and making use of the previously obtained expansion for 
(D® 2r K) H * f, the third term is 

J A H *D® 2r f(x)f(x)dx = J D® 2r k H *f(x)f(x)dx + ip 2r 

~ \m 2 (K) 2 [l d 2r ® (vec T H)® 2 ] </> 2r+4 . 

Thus, 

E[SCV r (H) - MISE2 r (H)] ~ |m 2 (K) 2 n- 1 |Gr 1/2 [vec T I d r ® (vec T H )® 2 ](G- 1/2 )® 2r+4 D® 2r+4 Z(0) 

+ lm 2 (K) 2 m 2 (L)[vec T I d r ® (vec T H)® 2 vec T G]^ 2r+6 

Calculations in Section [3] give G is order ri - 2 /( 2r + d + 6 ) ) as f or the plug-in selector, so sub- 
stituting to this into the derivative of the previous equation yields 

E{D H [SCV r (H) -MISE2 r (H)]} = 0([n _1 |G| _1 / 2 (tr G) _r ~ 2 + tr G]J rf 2) vecH 

= 0(n~ 2 /( 2r+d+6 )j d2 )vecH. 

Lemma [l] shows that vec(Hscv> — Hmise,tO is asymptotically equivalent to Dh[SCV,. — 
MISE2 r ](HMisE,r)- Since it was stated in Section [i] that E [|| vec(Hgcv,r — HMisE,r)|| 2 ] is 
dominated by its squared bias term, then vec(Hscv> — Hmise,?-) = Op(n~ 2 ^ 2,+d+6 ' J^) vec Hmise.t-- 
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